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Abstract 

Matrix Distributed Processing (MDP) is a C++ library for fast development of efficient parallel al- 
gorithms. MDP enables programmers to focus on algorithms, while parallelization is dealt with auto- 
matically and transparently. Here we present a brief overview of MDP and examples of applications in 
Computer Science (Cellular Automata), Engineering (PDE Solver) and Physics (Ising Model). 

1 Introduction 

Matrix Distributed Processing (MDP) [ 1 1[2| is a collection of classes and functions written in C++ for fast 
development of parallel algorithms such as solvers for partial differential equations, mesh-like algorithms, 
and various types of graph-based problems. These algorithms find frequent application in many sectors of 
physics, engineering, electronics and computational finance. 
MDP includes: 

• a natural syntax for the algorithms that is transparent to the underlying parallelization; 

• parallelization based optimization algorithms implemented in MPI; 

• functions for linear algebra computations with support of a Maple-like syntax; 

• statistical functions; 

• fitting functions; 

• a Parallel SIMulator (PSIM) which allows MDP algorithms to run on single processor machines with- 
out MPI (uses fork for creating the processes and socket pairs for communication). 



MDP was originally developed for and still constitutes the core of FermiQCD[3|[4|, a library for Lattice 
QCD computations. Lattice QCD is a Monte Carlo based numerical approach to the physics of composite 
particles made of quarks (e.g., protons and neutrons) and it is considered one of the most computationally 
intensive projects of modern physics. 

FermiQCD, was developed by the author at the University of Southampton (UK) and Fermilab (Depart- 
ment of Energy) and it is currently used by other physics departments around the world[ 5|[6|[7|[8|[9|. 

While Lattice QCD is currently one of the main applications of MDP, its range of applicability is not 
limited to it as we show in the following examples. 
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2 Example: Parallel Game of Life 



As a first example we show here how to write, in a few lines, a parallel program to find stable configurations 
for the Game Of Life. The game works as follows: 

• It is defined on a board (N x N cells) with periodic boundary conditions. 

• Each cell can be alive (1) or dead (0). 

• The game consists of a series of iterations and, at each iteration, if a cell is dead and the number of 
beighbor cells who are alive is greater or equal 3, the cell dies; if a cell is alive and the number of 
beighbor cells who are alive is 2 or 3, the cell remains alive; otherwise the cell will be dead. 

• Given a random starting configuration for the board, we want to determine its evolution and whether 
it reaches a stable configuration. 

Here is the entire parallel code: 



00 tinclude "mdp.h" 
01 

02 const int alive=l, dead=0; // some constants 

03 

04 int rulesof game ( int aOO, int aOl, int a02, 

05 int alO, int all, int al2, 

06 int a20, int a21, int a22) { 

07 int Sum=a00+a01+a02+al0+al2+a20+a21+a22; 

08 if (all==dead && sum==3) return alive; 

09 else if (all==alive && (sum==2 | | sum==3) ) return alive; 

10 return read; 

11 } // rules of the game 
12 

13 int main (int argc, char **argv) { 

14 mdp . open_wormholes (argc, argv) ; // open communication 

15 int L [] ={ 10, 10} ; // declare board size 

16 mdp_lattice board (2, L); // declare board 

17 mdp_f ield<int> C (board) ; // declare cells 

18 mdp_f ield<int> newC (board) ; // declare new set of cells 

19 mdp_site x (board) ; 
20 

21 forallsites(x) C(x)= (board . random (x) .plain()>0.5) ? alive : dead; 
22 

23 while (1) { // game iteration 

24 C. update (); // communicate! 

25 forallsites (x) // parallel loop 

2 6 newC (x) =rulseof game (C ( (x-1) -0) , C (x-1) , C ( (x-1) +0) , 

27 C(x-0), C(x), C(x+0), 

2 8 C ( (x+1) -0) , C (x+1) , C ( (x+1) +0) ) ; 

29 int diff=0; 

30 forallsites (x) { 

31 dif f +=abs (C (x) -newC (x) ) ; // count changed cells 
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32 C (x) =newC (x) ; 

33 } 

34 if(diff==0) break; 

35 } 

36 C.saveCcells.dat") 

37 mdp . close_wormholes ( ) ; 

38 return 0; 

39 } 

• Line 00 includes the basic library. 

• Lines 04-11 define the rule of the game. The values of a cell (all) and its neighbors are passed as a 
3x3 table (aO 0-a22). The function returns the new value for al 1. 

• Line 14 opens the parallel communication channels, line 37 closes them. 

• Line 15 declares a 2D array containing the board size, L. 

• Line 16 declares a lattice object (board) that represents the set of board sites and its topology. By de- 
fault a mesh topology with toroid boundary conditions. Two arguments are passed to the constructor: 
the dimension, and size of the board. For each board site, it is possible to specify on which parallel 
process it is going to be allocated as well as alternative topologies. 

• Line 17 declares a field of cells (C) on the board (the cells that live on it). 

• Line 18 declares an auxiliary field that will be necessary for the computation. 

• Line 19 declares a x a variable that represents a generic site of the board and will be used for looping 
over the sites and therefore the cells that live on them. 

• Line 21 sets the initial configuration of the board by looping over all sites x and setting the corre- 
sponding cell C (x) to a random alive or dead. Each process loops only over the sites that are 
allocated locally by the process. Notice how MDP provides a local random number generator for 
each site of the board, board . random (x) , which is vital in order to be able to reproduce results of 
stochastic algorithms. 

• Lines 23-35 loop over the iterations of the game. 

• Line 24 is performs a critical operation; it informs MDP that the value of the field C has changed and 
parallel communication is needed to synchronize those buffers that contain copies of non-local sites 
(synchronization). 

• Lines 25-28 apply the rules of the game at each site. The new values of the states for the cells are stored 
in the auxiliary field newC. Each process loops over the local sites only. For each local site (x), x+0 
represents the neighbor site when coordinate is incremented by one, x-0 represents the neighbor 
site when coordinate is decremented by one, x+1 represents the neighbor site when coordinate 1 is 
incremented by one, etc. 1 . 

• Lines 29-33 perform two operations in a single parallel loop: the new states for the cells (newC) are 
copied back into C ; the number of cells that have changed state are counted and the number is stored 
in dif f. 

'This notation may appear bizzarre but it is nothing more than a sum ("x* + i ) or difference — i ) of vectors where "a? is 
a vector that represents a site on the board and the integer i represents a unit verson in direction i. 



II store new cells 

// exit if cells didn't change 

// save cells 
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• Line 34 terminates the iteration if no cell has changed its state. 

• Line 36 saves the board configuration in a file cells . dat. 



Most of the parallel work is done by the constructor (mdp_lattice) of the board which, from the 
board topology, determines how to partition it and determines the optimal communication patterns; by the 
constructor (mdp_f ield) of the field of cells (C) which allocates the local cells and the buffers to store 
copies of non-local cells; and by the method update which performs communication to copy remote cell 
values into the local buffers. 

We'll show in the next few examples that is equally easy to implement fields of any class of objects. In 
the next example we'll consider a field of matrices. 



3 Example: Parallel PDE Solver 

Consider here, as a different example that presents many similarities with the one above, the following 
Laplace equation: 

V\{x) = f{x) (1) 

where ip(x) is a field of 2 x 2 complex matrices defined on a 3D space (space), x = (xo, x\,X2) limited 
by < Xi < Li, and 

L = {10,10,10} (2) 
/(x) = Asin(2vr3;i/Li) 

'■(!!) 

The initial conditions are <£i n itiai{%) = 0. We will also assume that Xj + Li = X{ (torus topology). 

Solution: In order to solve eq. (0 we first discretize the Laplacian (V 2 = <9 2 + df + 3f ) and rewrite it 

as 

[<p(x + fi)-2<p(x) + <p(x-Ji)] = f(x) (3) 

0=0,1,2 

where p, is a unit vector in the discretized space in direction fi. Hence we solve for <p(x) and obtain the 
following recurrence relation 

, x Eii=o,i,2 Vp( x + V-) + v( x - V)\ - f( x ) 

<p{x) = (4) 

o 

The following is a typical MDP program that solves eq. ([U by recursively iterating eq. @. Notice how 
the program is parallel but there are no explicit calls to communication functions: 

00 tinclude "mdp.h" 
01 

02 void main(int argc, char** argv) { 

03 mdp . open_wormholes (argc, argv) ; // open communications 

04 int L [ ] ={ 10, 10, 10 } ; // declare volume 

05 mdp_lattice space (3, L); // declare lattice 

06 mdp_site x (space); // declare site variable 

07 mdp_matrix_f ield phi ( space, 2 , 2 ) ; // declare field of 2x2 

08 mdp_matrix A(2,2); // declare matrix A 
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9 A(0,0)=1; A(0,1)=I; 

10 A (1,0) =3; A(l,l)=l; 

11 forallsites (x) // loop (in parallel) 

12 phi(x)=0; // initialize the field 

13 phi . update () ; // communicate! 
14 

15 for(int i=0; K1000; i++) { // iterate 1000 times 

16 forallsites (x) // loop (in parallel) 

17 phi (x) = (phi (x+0) +phi (x-0) + 

18 phi (x+1) +phi (x-1) + 

19 phi (x+2) +phi (x-2) - 

20 A*sin(2.0*Pi*x(l)/L[l]) ) / 6 ; // the equation 

21 phi . update () ; // communicate! 

22 } 

23 phi . save (" field_phi . mdp" ) ; // save field 

24 mdp . close_wormholes ( ) ; // close communications 



25 } 
Notes: 

• Line 04 declares the size of the box used to approximate the space L = {Lq, Li, L2}. 

• Line 05 declares a 3-dimensional lattice, called space, on the box L. MDP supports up to 10- 
dimensional lattices. By default, a lattice object is a mesh with torus topology. 

• Line 06 declares a site variable site x that will be used to loop over the lattice. 

• Line 07 declares a field of 2 x 2 matrices, called phi, over the lattice space. 

• Lines 08-10 define the matrix A. 

• Lines 11-12 initialize the field phi. Notice that phi is distributed over the parallel processes and 
forallsites is a parallel loop. 

• Line 13 performs synchronization. 

• Lines 15 through 23 perform 1000 iterations to guarantee convergence. 

• Line 16 loops over all sites in parallel. 

• Lines 17 through 20 implement eq. ©. Notice the similarity in notation. Here phi (x) is a 2 x 2 
complex matrix. 

• Line 21 performs synchronization. 

• Line 23 saves the field. Notice that all fields, including user-defined ones, inherit save and load 
methods from the basic mdp_f ield class. 
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4 Example: Parallel Ising Model 



As one more example of usage of MDP we report here a simple program for the Ising model. 

00 tinclude "mdp.h" 

01 void main(int argc, char** argv) { 



02 mdp . open_wormholes (argc, argv) ; 

3 int L[]={100}; 

04 mdp_lattice line(l,L); // declare the lattice 

05 mdp_f ield<int> spin (line); // declare the spin variables 
6 mdp_site x(line); 

07 int dE=0, M=L[0], dM=0; // E for Energy, M for Magnetization 

08 float kappa=2.0; // inverse temperature 

09 forallsites (x) spin(x)=+l; // set initial conditions 

10 while (1) { 

11 dM=0; 

12 for (int parity=EVEN; parity<=ODD; parity++) { 

13 forallsitesof parity (x, parity ) { 

14 dE=2*spin (x) * ( spin (x-0 ) +spin (x+0 ) ) ; // compute energy variation 

15 if (exp (-kappa*dE) >mdp_random. plain () ) // Monte Carlo accept-re ject 

16 { spin (x) *=-l; dM=dM+2*spin (x) ; } 

17 } 

18 spin . update (parity ) ; // communicate 

19 } 

2 mdp. add (dM) ; 

21 M=M+dM; // compute new value for the energy 

22 mdp << "magnetization=" << M << endl; 

23 } 

24 mdp . close_wormholes ( ) ; 



25 } 

In this example: 

• Lines 3-4 declare a ID lattice of 100 points (line). 

• Line 5 declares a field of integers (spin) on the lattice. 

• Line 7 sets the total magnetization M for this initial spin configuration. 

• Line 9 sets the initial configuration: all field variables equal to 1. 

• Line 14 computes the energy variation (dE) of each Markov Chain Monte Carlo (MCMC) step. 

• Lines 15-16 perform the Monte Carlo accept-reject. If a change is accepted the spin at site x is flipped 
and the total magnetization M changes (line 16). 

Note how at each MCMC step, first the code tries to flip the spins at even locations then, after it updates 
the lattice sites, it tries to flip the spins at odd locations (line 13). This guarantees computation results are 
independent on parallelization of the lattice line. 

Since this even-odd distinction is common in many lattice algorithms, MDP stores all even lattice sites 
and all odd lattice sites close together. This speeds up loops over one of the two subsets and also speeds up 
communication. In fact, in this example, we are able to limit the synchronization (update) to the site of a 
given parity (line 18). 
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5 Linear Algebra and Other Tools 



MDP includes a Linear Algebra package and other tools. Some of the most important classes are: 

• class mdp_real, that should be used in place of float or double; 

• class mdp .complex, for complex numbers; 

• class mdp_array, for vectors and/or multidimensional tensors; 

• class mdp_matrix, for any kind of complex rectangular matrix; 

• class mdpjneasure, for error propagation; 

• class mdp_ jackboot, a container for jackknife and bootstrap algorithms. 

The most notable difference between our linear algebra package and other existing packages is its natural 
syntax. 

For example: 

mdp_matrix A, B; 
A=Random. SU (7) ; 

B=exp (A) +inv (A) *herm±t±an (A) +5; 

reads like 
A and B are matrices 

A is a random SU(7) matrix 

B = e A + A~ l A H + 5 • 1 
Note that each matrix can be resized at will and is resized automatically when a value is assigned. 
MDP includes functions for fitting such as the Levenberger-Marquardt algorithm. 

6 Lattice, Site, and Field 

An mdp_lattice is the class that describes the space on which fields are defined; it stores the topology of 
the space (by default that of a torus in d dimensions) and information about partioning of the space over the 
parallel processes. In MDP, a lattice is a graph, defined as a collection of points (lattice sites) connected by 
links (they specify the topology). Each site is uniquely mapped to one of the parallel processes. 

The only restriction is that the graph must have a degree less than 20. From now on we will assume the 
default topology of a torus; therefore the lattice should be thought of as a mesh in d < 10 dimensions. 

The constructor class mdp_lattice determines on which process to store each site, determines the 
neighbors of each site, and determines the sizes of the buffers where each process keeps copies of those sites 
that are non-local but are neighbors of the local sites. 

The constructor also allocates a parallel random number generator so that each site of the lattice has its 
own independent random number generator. This is important for parallel Monte Carlo applications of MDP 
and ensures reproducibility of computations on different architectures. 

On each lattice it is possible to allocate fields. Some fields are built-in, for example 

mdp_comp 1 e x_f i e 1 d 

i.e. the field of complex numbers. The user can declare any type of field. For example a field of 5 float 
per lattice site: 
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class S { 
public: float S [5] ; 

}; 

int L [ ]={10, 10, 10} ; 
mdp_lattice cube (3, L); 
mdp_f ield<S> psi (cube) ; 

This code declares alO x 10 x 10 lattice (cube) and a field (psi), that lives on the cube. The site 
variables of psi, psi (x) belong to class S (assuming x is an mdp_site on the cube). 
User-defined fields can be saved, loaded, and synchronized 

psi . save ( "filename " ) ; 
psi . load ( "filename " ) ; 
psi . update ( ) ; 

Synchronization means that all processes will perform MPI communications to make sure all buffers 
that contain copies of non-local site variables are updated with the proper values. The method update should 
be called immediately after the local site variables of a field have been changed. 

Once a field object is declared in the field constructor, each process dynamically allocates memory for 
the buffers that store the copies of those sites that are non-local, but are neighbors of the local sites. These 
buffers are created in such as way to ensure optimal communication patterns. 

Every time a field changes, for example in a parallel loop such as 

f orallsites (x) phi(x)=0; 

the program notifies the field that its values have been changed by calling 

phi . update ( ) ; 

The method update performs all required communication to copy site variables that need to be syn- 
chronized between each couple of overlapping processes. 

Lattice sites are represented by objects of class mdp_site. Site objects can be looped over in parallel 
loops (such as forallsites) but it also possible to explicitly address a specific site by specifying the 
site coordinates. Obviously, only the process that stores a site locally should address a specific site. Class 
mdp_site has methods to check if a site is local, if it is non-local but a local copy is present, and which 
process stores the site locally. 

7 Optimal Communication Patterns 

In MDP, the lattice objects, according to the lattice topology and the parallel partitioning, determine the 
optimal way to store site variables in memory and performing parallel communication. This information is 
then used by the field method update that performs the synchronization of the field variables. 

Note that MDP does not attempt to overlap computation and communication. By "optimal communica- 
tion" pattern we mean that, under the assumptions below, the method update minimizes network traffic and 
data copies. 

Current optimizations are based on the assumption that each processing node has one and only one 
network card and that the network is isotropic (latency and bandwidth for each couple of nodes is the same). 
This assumption is generally true for Ethernet and Myrinet clusters. 

Communications are optimal in the sense that: 
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• Each process retrieves all non-local site variables in a single send/recv for each process that contains 
sites which are neighbors of local sites. 

• Two processes that do not store neighbor sites do not communicate with each other. 

• No process is involved in more than a single send and a single receive at one time. 

• Each process stores close in memory those copies of non-local site variables which are local to the 
same process. In this way synchronization does not require the use of additional buffers receiving 
buffers. 

This technque is particularly efficient for algorithms that only require next-neighbor communication and 
run on a all-to-all network topology such as Ethernet or Myrinet. 

It is possible, in principle, to change the above communication patterns to optimize communication for 
other network topologies. 

Although communications are currently based on MPI, they do not make use of communication tags. 
It is therefore possible, in principle, to speed-up communication by using a faster tagless and bufferless 
protocol such as Myricom GM. 

Our communication patterns have the effect of making communication almost insensitive to network 
latency, and communication speed is dominated by network bandwidth. Benchmarks are very much appli- 
cation dependent since parallel efficiency is greatly affected by the lattice size, by the amount of computation 
performed per site, processor speed and type of interconnection. In many typical applications, like the one 
described in the preceding example, the drop in efficiency is less than 10% up to 8 nodes (processes) and 
less than 20% up to 32 (our tests are usually performed on a cluster of Pentium 4 PCs (2.2GHz) running 
Linux and connected by Myrinet). 

8 MDP and PSIM 

For portability reasons, MDP is based on MPI. Nevertheless it is desirable to be able to run, test and de- 
bug MDP programs on a single node (with single or multi processor architecture) without having to install 
MPI. The latest version of MDP includes a Parallel SIMulator (PSIM). Despite the name, this is not quite a 
simulator but an emulator, i.e., a message passing library that uses local (unix/posix) socket pairs. PSIM is 
Objected Oriented and is not based on MPI. 

When compiling with PSIM, the parallel processes are created at start-up by forking. The number of 
parallel processes is specified at runtime by passing the following command line argument to any MDP 
executable program, 

-PSIM_NPROCS=4 

(this makes 4 parallel processes). 

PSIM also creates a communication log that can be used for debugging. MDP with PSIM has been tested 
on Linux, Mac and Windows (with cygwin). 

For single processor node, using PSIM does not introduce any speed-up, but, for a small number of 
processes (2-16) it does not slow down the code either. For multi-process shared memory architecture, 
parallelization of PSIM should produce a speed-up comparable with MPI. We have not been yet performed 
such tests. 

Moreover, PSIM should perform well on openMosix clusters as soon as openMosix starts supporting 
migratable sockets since all communication between the parallel processes will be done by the operating 
system. Unfortunately openMosix does not support migratable sockets yet. 
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9 Conclusions 



MDP is an easy, powerful, and reliable tool for developing efficient parallel numerical applications. We have 
shown here examples in Computer Science (Cellular Automata), Mathematics (PDE solver) and Physics 
(Ising model). 

MDP enables the programmer to focus on algorithm design while parallelization issues are dealt with 
automatically in a way transparent to the programmer. 

The underlying communication functions are written in MPI but it is possible also to compile it without 
MPI using the built-in PSIM emulator which enables one to run parallel processes on a single node and/or 
single processor architectures, such as a PC. This is useful for testing and debugging purposes. 

All of the features here described are fully functional and have been tested in real-life applications such 
as FermiQCD, developed by the University of Southampton (UK) and Fermilab (Department of Energy). 
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Code download 

MDP, PSIM and FermiQCD are currently distributed together and can be downloaded from: 

http : //www . f ermiqcd . net 
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